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ABSTRACT Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated (cas) genes constitute 
the CRISPR-Cas systems found in the Bacteria and Archaea domains. At least in some strains they provide an efficient barrier 
against transmissible genetic elements such as plasmids and viruses. Two CRISPR-Cas systems have been identified in Esche- 
richia coli, pertaining to subtypes I-E (cas-E genes) and I-F (cas-F genes), respectively. In order to unveil the evolutionary dy- 
namics of such systems, we analyzed the sequence variations in the CRISPR-Cas loci of a collection of 131 E. coli strains. Our 
results show that the strain grouping inferred from these CRISPR data slightly differs from the phylogeny of the species, suggest- 
ing the occurrence of recombinational events between CRISPR arrays. Moreover, we determined that the primary cas-E genes of 
E. coli were altogether replaced with a substantially different variant in a minor group of strains that include K-12. Insertion 
elements play an important role in this variability. This result underlines the interchange capacity of CRISPR-Cas constituents 
and hints that at least some functional aspects documented for the K-12 system may not apply to the vast majority of E. coli 
strains. 

IMPORTANCE Escherichia coli is a model microorganism for the study of diverse aspects such as microbial evolution and is a com- 
ponent of the human gut flora that may have a direct impact in everyday life. This work was undertaken with the purpose of elu- 
cidating the evolutionary pathways that have led to the present situation of its significantly different CRISPR-Cas subtypes (I-E 
and I-F) in several strains of E. coli. In doing so, this information offers a novel and wider understanding of the variety and rele- 
vance of these regions within the species. Therefore, this knowledge may provide clues helping researchers better understand 
these systems for typing purposes and make predictions of their behavior in strains that, depending on their particular genetic 
dotation, would result in different levels of immunity to foreign genetic elements. 
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Clustered regularly interspaced short palindromic repeat 
(CRISPR)-Cas systems consist of two main functional com- 
ponents: (i) at least one cassette of DNA repeats regularly spaced 
by unique sequences called spacers (1) and (ii) a set of genes 
named cas (for "CRISPR associated") (2). Although these systems 
have tentatively been involved in disparate functions (3-6), it was 
recently demonstrated that they form part of the diverse reper- 
toire of tools utilized by prokaryotic microorganisms to prevent 
infection by foreign DNA (7, 8). 

CRISPR-Cas systems interfere with genetic invaders by an un- 
precedented mechanism (9, 10). Interference is achieved after the 
CRISPR arrays are transcribed and the resulting pre-CRISPR 
RNAs (pre-crRNAs) processed by specific Cas proteins into 
monospacer crRNA molecules (11, 12). Afterward, crRNAs hy- 
bridize with complementary sequences and, concomitantly, a Cas 
endonuclease cuts within the target, leading to its degradation (9, 
13). During invasion, new spacers are incorporated into the host 
CRISPR arrays, providing adaptive immunity (7, 14, 15). As a 
result of this adaptation, the spacer identities and numbers may 
largely differ between strains, reflecting the diverse previous en- 
counters and activity of the acquisition machinery (16-18). 
Escherichia coli isolates may carry two different CRISPR-Cas 



systems (19, 20) that belong to either subtype I-E or subtype I-F 
(21). The components of the I-E system are split between two loci, 
CRISPR-I and CRISPR-II, flanked by the iap and cysH genes and 
the ygcE and ygcF genes, respectively. In the CRISPR-I locus, there 
is a cassette of type 2 repeats (22), termed the CRISPR2.1 array, 
and a set of eight cas-E genes (namely, cas2, casl, cas6e, cas5, cas7, 
cse2, csel, and cas3). The proteins encoded by cas6e, cas5, cas7, 
cse2, and csel make the Cascade complex, which generates the 
crRNA molecules (9, 11). Either one repeat array (CRISPR2.2-3) 
or two repeat arrays (CRISPR2.2 and CRISPR2.3, separated by 
0.5 kb) can be present in the CRISPR-II locus. Next to both 
CRISPR2.1 and CRISPR2.3 cassettes, there are leader sequences 
(23) that harbor the promoters for their transcription (24, 25, 26). 
The I-F system is located between the clpA and infA genes in the 
locus referred to as CRISPR-III. This locus consists of up to two 
arrays of type 4 repeats (22), consequently called CRISPR4.1 and 
CRISPR4.2, and an operon of 6 cas-F genes (namely, cas6f-csy3- 
csy2-csyl-cas2-cas3-casl). Leader sequences are observed ad- 
joining each repeat array (19). When only one cassette is found in 
this locus (which implies absence of cas genes), it is named 
CRISPR4.1-2. Although all E. coli strains analyzed so far carry at 
least one CRISPR4 repeat unit, just a few bear cas-F genes, usually 
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representing the only cas in the cell (only a single strain has been 
identified so far carrying I-E and I-F cas genes [19]). In contrast, 
most E. coli strains harbor a complete I-E system and only certain 
clonal groups entirely lack it, albeit there are a variety of interme- 
diate situations (19, 27). 

While some bacteria harbor active CRISPR-Cas systems that 
efficiently prevent lateral gene transfer and that are primarily in- 
volved in defense (7, 13, 28), these roles are not evident in E. coli 
(20, 27, 29). For instance, in K-12-derivative strains, the CRISPR- 
Cas I-E system is almost completely silenced under normal labo- 
ratory growth conditions (24, 25). Moreover, the diversity of spac- 
ers encountered in the species is considered reduced compared to 
what would be expected for a conventional immune system (20). 
Yet the pervasiveness of these systems in E. coli (19) suggests that 
they must provide an advantage for the cell. 

In this work, we conducted evolutionary analyses of the 
CRISPR loci in a set of available E. coli sequences and a collection 
of isolates (ECOR collection [30] ). ECOR strains were included in 
the study because they represent much of the species genetic vari- 
ability and their phylogeny has been well established, defining up 
to six groups (namely, A, Bl, B2, D, E, and F [31]). Also included 
were sequences from different Shigella species, a polyphyletic ge- 
nus whose members can be considered highly specialized patho- 
genic E. coli strains (32-34). 

This report provides a comprehensive view of the E. coli 
CRISPR-Cas regions that will be of utility for functional and typ- 
ing studies. We obtained data supporting the replacement of a 
complete set of cas-E genes, together with the associated leader, 
with a minor variant represented by the profusely characterized 
K-12 system. Furthermore, in line with previous suggestions (35, 
36), we provide consistent data to confirm that the dynamics of 
the CRISPR-Cas systems is greatly affected, at least in the case of 
cas-E, by insertion elements as driving forces implicated in their 
evolution. 

RESULTS 

Evolutionary dynamics of the CRISPR arrays across E. coli. In 

order to determine how the CRISPR arrays have evolved in E. coli, 
a binary clustering analysis of the spacer content was performed in 
a panel of 1 3 1 isolates and the corresponding tree was constructed 
as described in Materials and Methods. In brief, for each CRISPR 
array, strains were classified into groups (denoted spacer groups, 
or "SGs") depending on the identity of the spacers in that array: 
strains within a given SG shared at least one spacer with other 
member(s) of the group. Concurring with previous observations 
(19, 24), the majority of shared spacers were located at the leader- 
distal end of the array (data not shown). A total of 14 SGs were 
identified for CRISPR2.1, 12 for CRISPR2.3, 3 for CRISPR2.1-2, 
and 2 for both CRISPR4. 1 and CRISPR4.2 arrays (Table SI). Every 
SG was considered a character, and the combined binary data for 
all the arrays of each strain were used to generate a spacer-based 
tree (see Fig. SI in the supplemental material). Given that multi- 
locus sequence typing (MLST) produces the most commonly ac- 
cepted phylogenetic reconstruction of the species (31, 37, 38), an 
MLST tree was generated for the same strains (see Fig. S2 in the 
supplemental material) and the ultrametric matrices of the two 
trees were compared with the use of the COPH and MXCOMP 
programs (NTSYSpc 2.0 package). The correlation value obtained 
was 0.5 17 for a cutoff of 0.083 at a significance level of 0.05; hence, 
the two trees were considered comparable (39). Nevertheless, as 
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FIG 1 Trees from cas-F genes (A) and MLST data (B) of sequenced E. coli and 
Shigella strains. Yersinia pestis serovar Angola and E. fergusonii ATCC 35469 
were used as outgroups. Both sets of genes were concatenated and aligned 
separately. Colors indicate the MLST group (green, Bl; red, B2). Shigella sp. 
strain D9 is abbreviated as SspD9. 



previously observed in a similar analysis performed with another 
set of strains (40), some clusters of the spacers tree did not con- 
form to the MLST groups (see Table SI in the supplemental ma- 
terial and compare Fig. SI with Fig. S2), suggesting that recombi- 
national events between CRISPR loci of different strains might 
have taken place. Notably, Escherichia fergusonii ATCC 35469 
shares a CRISPR2.1 spacer with E. coli K-12-MG1655 (here re- 
ferred to as K-12) and related strains, despite both species having 
diverged about 46 million years ago (Mya) (41). This observation 
further supports the occurrence of spacer exchange even among 
phylogenetically distant strains. Related to the variation at the 
spacer group level, a great divergence in the number of common 
spacers that constitute the SGs for the I-E system in the different 
clonal groups was found (ranging from 1 to 40; results not 
shown). Usually, strains belonging to SGs with the lowest num- 
bers of spacers also corresponded to basal (i.e., earlier to diverge) 
MLST groups (data not shown). In this sense, MLST groups A and 
Bl separated more recently from the basal B2, D, E, and F groups 
(see Fig. S2). Hence, it might be possible that, during the phylo- 
genetic evolution of the species, a substantial change occurred that 
was reflected in these variations. 

Evolution of cas-E and cas-F genes. The evolutionary relation- 
ships of the E. coli cas genes were inferred by considering the panel 
of sequenced strains studied in this work. The phylogenetic trees 
were obtained after alignments performed with concatenated se- 
quences of the cas-E genes (10 sequences; Fig. 1A) or cas-E genes 
(37 sequences; Fig. 2A). Each of these cas trees was compared with 
an MLST tree (Fig. IB and 2B) built with the corresponding 
strains. As a result, different degrees of correlation were observed 
depending on the cas subtype. Whereas for cas-F genes the match- 
ing was almost coincident (Fig. 1 ), a correspondence with phylog- 
eny was more discordant in the case of cas-E (Fig. 2A and B). 
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FIG 2 Trees from cas-E genes (A), MLST data (B), and casl-cas2-cas3 sequences (C) of sequenced E. coli and Shigella strains. E. fergusonii ATCC 35469, E. albertii 
TW07627, and S. enterica serotype Choleraesuis SC-B67 were used as outgroups. All sets of genes were concatenated and aligned separately. Colors denote the 
MLST group (dark blue, A; green, Bl; fuchsia, D; light blue, E; dark red, F). Two separate columns to the right of each tree indicate the distinctive spacer groups 
for CRISPR2.1 (2.1) and CRISPR2.3 (2.3) arrays. Shigella strains are designated with an abbreviation (Sb, S. boydii; Sd, S. dysenteriae; Sf, S.flexneri; Ss, S. sonnei; 
ssp., Shigella sp.) and a specific code as indicated in Table SI in the supplemental material. Bl strains with a complete set of cas-E genes showing a significantly 
altered topology with respect to the MLST phylogeny are boxed in red. 



Notably, a cluster (here referred to as the E2 cluster) composed of 
few strains (see Fig. 2A) was separated from a major one (named 
El cluster) that included two non-E coli strains used as outgroups 
(Salmonella enterica subsp. enterica serotype Choleraesuis SC-B67 
and Escherichia albertii TW07627) (Fig. 2B). The E2 cluster com- 
prised some MLST A strains, a group E strain (S. dysenteriae 
Sdl97), and the other outgroup included in the analysis (E. fergu- 
sonii ATCC 35469; Fig. 2A). These results are compatible with a 
different origin for each cluster. In addition, a clade of some MLST 
Bl strains (including B171 among others) branched at a position 
that was more basal than that defined by MLST, suggesting that 
recombination events could have taken place at the cas region. 

A similar phylogenetic analysis considering only the genes that 
encode the Cascade region (9) revealed the same altered topology 
found in Fig. 2 A (results not shown). Conversely, when the casl- 
cas2-cas3 genes were aligned, a better correlation of the resulting 
cas tree (Fig. 2C) with the phylogeny of the species (Fig. 2B) was 
observed for these discrepant Bl strains, thus indicating that 
Cascade-encoding genes are more prone to variation than the 
casl-cas2-cas3 genes, at least in these strains. However, El and E2 
clusters were still defined as separated clades, providing further 
support to the idea of a distinct origin of at least non-Cascade 
genes. 

To determine if recombination could have accounted for the 
tree discrepancies observed in the case of the B171 clade of Bl 
strains, a prediction analysis of such events was carried out with 



the cas-El genes (see Fig. S3 in the supplemental material) using 
the GENECONV program (see Materials and Methods). The re- 
sults confirmed that, concurring with the phylogenetic trees, re- 
combination at the Cascade region was more relevant for the 
above-mentioned Bl cluster. Unsurprisingly, this recombination 
appeared to be frequent among closely related strains, especially 
within the Bl group. Nevertheless, this higher occurrence within 
them was partially due to their prevalence among the strains un- 
der study. In contrast, no recombination was detected for S. en- 
terica, and only two of such events were found in E. albertii, both 
with D group strain 042, thus evidencing that recombination was 
favored among strains sharing close phylogenetic relationships. 
These results indicated that the phylogenetic variability of the 
cas-E systems was driven not only by genetic drifts (i.e., point 
mutations) but also by partial gene substitutions, at least within 
certain strains. 

To further examine the possible impact in phylogeny of this 
higher variability at the Cascade level, we calculated the values of 
the codon adaptation index (CAI) as a measure of the codon usage 
bias of a sequence in relation to its genetic environment (42). 
These studies were performed for the corresponding cas genes of 
those MLST Bl strains represented in Fig. S3 in the supplemental 
material, including B171 and E22 (results not shown). Remark- 
ably, statistically significant differences were observed for the CAI 
values of B171 and E22 with respect to the other strains consid- 
ered. This divergence was found not only for csel, cse2, and cas7 
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(comprising the bulk of the Cascade genes) but also for cas3, de- 
spite the latter not disrupting the established MLST topology 
(Fig. 2C). Whereas CAI values for cas3 were higher in both B171 
and E22 (possibly reflecting a more optimal gene expression), a 
lower value was found in the case of csel, cse2, and cas7, perhaps 
hinting at a functional correlation between these two sets of genes. 
Moreover, no major differences in CAI values were found for 
Cascade genes cas5 and cas6e, as well as casl and cas2, thus sug- 
gesting a CAI-polarized divergence of the cas genes depending on 
the strains. 

Despite these differences found as the consequence of both 
genetic drift and recombination, the most significant discordance 
between cas-E and MLST groupings corresponded to the segrega- 
tion of the E2 cluster in the cas tree (Fig. 2). Analyses of the codon 
usage and the guanine and cytosine (GC) content of the cas genes 
could provide further insight on their origin and evolution. These 
two parameters are linked to the adaptation of specific sequences 
to their genetic context (43-45) and had been previously reported 
in certain CRISPR systems unrelated to E. coli (35). First, the GC 
content of El and E2 genes was calculated to assess whether these 
two variants could have been independently acquired. In agree- 
ment with this possibility, the mean GC percentage of El genes 
was substantially different from those in the E2 group (53.5 versus 
45.6). Moreover, these values contrasted with those of cas-F genes 
(50.5%) and the K-12 genome (50.7%), suggesting a more recent 
acquisition of the cas-E genes. To corroborate this observation, a 
similarity tree based on codon usage frequencies was constructed 
(Fig. 3). Whereas cas-E genes clustered along with the genome, the 
cas-E genes split into two main groups corresponding to El and 
E2. The robustness of this grouping with respect to the K-12 ge- 
nome was statistically supported. Remarkably, the E2 genes were 
the more dissimilar with respect to the genome. Taking these re- 
sults together, it can be concluded that El and E2 conform to 
substantially distinct cas-E variants in E. coli, probably having 
been incorporated into the genome after the cas-E genes, with E2 
as the most recent acquisition. 

The occurrence of each cas-E variant in the ECOR collection 
strains was determined by PCR performed with a primer match- 
ing variant-specific cas3 sequences, along with a second primer 
annealing with a conserved region downstream of cysH (see Ma- 
terials and Methods). Amplification of a strain with either 
cas3El-R or cas3E2-R primers would imply that it harbors the 
corresponding variant of cas3 and, presumably, of the rest of the 
cas-E genes. Only strains yielding amplification for one variant 
and not the other were further considered (see Table SI in the 
supplemental material). The results of these amplifications, to- 
gether with data of sequenced strains, showed that cas-E2 genes 
are present in most E-carrying strains from the MLST group A 
(with the exceptions of EC01, EC06, EC24, UMNF18, and SI 191), 
a strain from group E (SD 197), and E. fergusonii ATCC 35469. 

Diversity of CRISPR regions. In order to gain further insight 
into the origin and evolution of the E. coli CRISPR-Cas systems, a 
survey of the iap-cysH, ygcE-ygcF, and clpA-infA regions, harbor- 
ing CRISPR-I, CRISPR-II, and CRISPR-III loci, respectively, was 
performed. 

The diversity of genetic elements in the CRISPR-I locus (con- 
taining the cas-E genes and CRISPR2.1 array) was determined for 
sequenced E. coli and related strains. First, we noticed that the 
leader of CRISPR2.1 was linked to the presence of cas genes. More- 
over, the leader sequences were related to the specific gene variants 
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(see Fig. S4 in the supplemental material) and are therefore re- 
ferred to as LI and L2, for the associated gene variants El and E2, 
respectively. 

In addition to the leader and cas-E variants, the occurrence of 
diverse insertion sequences (IS), a hok sok system, and conserved 
intergenic regions (IG) was considered to establish the diversity of 
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CRISPR-I arrangements. Nine main organizations (named A to I; 
Fig. 4) were defined. The arrangements of affiliation thus estab- 
lished for sequenced genomes and those inferred for ECOR strains 
in accordance with the size of CRISPR-I PCR amplicons (see Ma- 
terials and Methods) and previous data (19) are indicated in Ta- 
ble SI. Specific genetic elements were associated with the absence 
of cas-E genes or the presence of a given cas/leader variant. For 
instance, all E. coli strains carrying El were linked to the presence 
of IG4 and the E2 variant was invariably associated with IG5 (i.e., 
both elements are absent in strains lacking cas-E genes; Fig. 4). 
Moreover, the hok sok toxin-antitoxin system (46) is present in all 
strains but those harboring E2, and IG3 was found in El and 
strains without cas-E genes, except for the G arrangement (Fig. 4). 
Taking into account the presence or absence of these genetic ele- 
ments in the analyzed strains as well as predicted intermediate 
situations (see arrangements A H1 and A H2 in Fig. 4), a scenario of 
cas-E deletion or acquisition in E. coli could be inferred. Figure 4 
shows the most parsimonious order of events concurring with the 
phylogeny of the species. Arrangement A, which is present in 
strains of all phylogenetic groups but B2 (with the exception of the 
highly modified case A 5 ; Fig. 5), could be considered the most 
ancient (i.e., cas-El). Deletion of the IG6-IG3 intervening region 
would give rise to arrangement B and a later insertion of ISu (un- 
characterized IS) in case C. In unrelated events, the insertion of 
IS 1 86 and ISu into case A strains would lead to arrangements D, E, 



and F and the predicted hypothetical cases A H1 and A H2 . The sub- 
sequent recombination between equivalent IS elements in A H1 
and A H2 cases would have produced removal of the LI leader and 
the cas-El genes and partial depletion of the CRISPR cassette, as 
observed, for example, in case G. Next, a module composed of the 
IG5 region, a complete set of cas-E2 genes, and an L2 leader would 
be inserted in A H2 , leading to case H. Moreover, the conservation 
of the CRISPR region next to IG6 in both cas-El- and cas-E2- 
harboring strains and the presence in the leader-proximal region 
of L2-specific CRISPRs (19) strongly suggest that the insertion 
module also carried at least one CRISPR unit and that the integra- 
tion took place at the CRISPR array. Finally, the I arrangement 
could have been generated after insertion of an IS186 element 
within the CRISPR cassette of an H strain. Some authors have 
previously suggested a possible role of ISs in the evolution of the 
CRISPR-Cas systems, merely based on the incidence at these re- 
gions of those elements (35, 36). Here, we provide solid evidence 
supporting that interpretation. 

Minor arrangements derived from the major ones were ob- 
served, the majority of them involving additional ISs that presum- 
ably promoted further deletions (Fig. 5). In agreement with the 
higher incidence of this sort of element in Shigella genomes than in 
E. coli genomes (34, 47, 48), the occurrence of these variants was 
prominent in the former. 

In the case of the CRISPR-II locus (carrying CRISPR2.2 and 2.3 
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results fromC. Diez-Villasenor. The arrangements of£. albertii TW07627 (E.al) and S. enterica Choleraesuis SC-B67 (S.en), with their distinctive regions, are also 
included. 



arrays), no significant variations in the CRISPR flanking regions 
of the sequenced strains were found (data not shown), aside from 
the sporadic presence of insertion elements, mostly in Shigella 
strains, that had no apparent impact on the evolution of this re- 
gion. Moreover, in contrast to CRISPR2.1, the leaders of 
CRISPR2.3 (leader 2.3) were similar in all cases, in both cas-El- 
and cas-E2-carrying strains, as well as in those lacking these genes 
( see Fig. S5 in the supplemental material) . Moreover, many strains 
pertaining to the same SG for the CRISPR2.3 array possessed ei- 
ther the I-El or the I-E2 variant, as seen in Table SI. These obser- 
vations strongly suggest that the CRISPR-II locus was already 
present when the replacement of the leader-cas elements in 
CRISPR-I took place and that the two CRISPR I-E loci have 
evolved in distinct manners. 

As in the case of CRISPR-II, the variability of the CRISPR-III 
locus (TF system region) was limited and no IS was usually pres- 
ent (see Fig. S6 in the supplemental material), the only exception 
being the B7A strain, with ISJOO located between the cas-F genes 
and the leader of CRISPR4.2. Thus, in agreement with what was 
observed at the level of cas-F genes and CRISPR4 arrays, this re- 
gion is more clonal than the TE counterpart. 

Timeline of acquisition of the CRISPR-Cas systems. The ex- 
istence of two distinct CRISPR-Cas subtypes in E. coli, one of them 



presenting two alternative variants, raises the issue of when they 
were incorporated into the genome. Evidence of the presence of 
equivalent I-E and I-F systems in the corresponding loci of E. coli, 
E. fergusonii, E. albertii, and S. enterica strains (Fig. 5; see also 
Fig. S6 in the supplemental material and reference 19) denotes an 
ancient and widespread occurrence. As seen above, frequencies of 
codon usage and GC content of cas genes pointed out that the I-F 
system of E. coli was more related to its genomic environment 
whereas the E2 variant would be the most dissimilar (Fig. 3). These 
two parameters can be used as indicators of the relative times of 
acquisition of genes: after a novel sequence is incorporated into a 
genome, it gradually adapts these features to match those of its 
surroundings in the process known as amelioration (40, 45). 
Hence, the more ameliorated TF system would have been the first 
to incorporate into the genome followed by TE1 and, more re- 
cently, I-E2 replaced TE1 in some strains. Against this line of 
reasoning, the higher concordance of TF with the genome could 
be merely interpreted as the result of this system evolving more 
rapidly toward homogenization. Yet the inferred mutation rates 
for the sets of cas-E and cas-F genes, as calculated from their CAI 
values (42, 49), showed no statistically significant difference be- 
tween cas-El and cas-F, therefore arguing against I-F evolving 
faster. Moreover, the codon usage frequencies of the E. coli K-12 
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genome (a strain with cas-E2 genes and no cas-E) that were in- 
cluded in the corresponding tree (Fig. 3) grouped with cas-E. Also 
in line with these observations, the above-mentioned differences 
at the cas3-cysH intergenic regions among strains, with some IGs 
associated with certain cas-E variants (Fig. 4; see also Fig. S4 and S5 
and Table S 1 in the supplemental material) , further supported this 
order of incorporation. 

The order of acquisition of I-E and I-F systems in E. coli has 
been previously addressed based on their presence or absence in 
related Salmonella and Escherichia species (50), suggesting that the 
I-E system preceded the I-F system. In our study, we considered a 
more comprehensive set of data that allows a more parsimonious 
explanation. In particular, two I-E variants have now been recog- 
nized, additional species and a larger number of strains have been 
included, and parameters such as codon usage and GC content 
have been considered. As mentioned above, on the basis of the GC 
content and codon usage values and also of support by the asso- 
ciation of certain IGs with specific cas-E variants, we propose that 
I-F is the more ancestral of the two systems. Afterward, TE1 fol- 
lowed and was later replaced in certain strains by I-E2. Thus, we 
considered the mean divergence time inferred from the most dis- 
tant (basal) strains in possession of the I-F, I-El, or I-E2 system to 
calculate their time of acquisition. This divergence was deter- 
mined by taking into account a rate of mutation of 6.86 X 10~ 10 
substitutions per year, estimated from a mean 6.3 1% difference in 
MLST distances between the E. coli strains and their closest out- 
group {E. fergusonii), the two species having separated 46 Mya 
(41). Thus, we concluded that the extant I-F was acquired some 
20.5 ± 0.25 Mya, followed by I-El at about 19.5 ± 0.8 Mya, the 
latter then being replaced with I-E2 at approximately 14.1 ± 
0.6 Mya, perhaps from an unknown donor that also transferred it 
to E. fergusonii. 

DISCUSSION 

Multiple levels of genetic exchange and variability at the 
CRISPR-Cas regions. The two CRISPR-Cas systems present in 
E. coli have experienced recombination and gene exchange in dif- 
ferent degrees. While cas-E genes mainly follow the clonal frame 
characteristic of most genes within the species (31, 37, 38), cas-E 
genes are considerably prone to genetic exchange. This variation 
involving the cas-E genes is more frequent in strains ascribed to 
certain clonal MLST groups (e.g., A and Bl ). Also, it extends to the 
adjacent CRISPR array and the intergenic regions, where 
CRISPR2.1 shows a higher diversity (i.e., more SGs, less clonally 
related with the MLST phylogeny) than CRISPR2.3. This variabil- 
ity in CRISPR2.1 is associated with the neighboring cas genes (as 
SGs relate better to cas-E than to MLST groupings), thereby hint- 
ing at a concerted shift of these adjacent regions. This association 
is also reinforced by the invariable correlation of a specific cas-E 
variant and the corresponding leader. This cas-leader-array rela- 
tionship hints that multiple exchanges may have in fact occurred 
involving this region, with each set of genes most likely carrying its 
own leader and possibly even its own CRISPR array. This would 
suggest that at least some compatible interaction between the cas 
genes and the leader-repeat regions might be needed for optimal 
operation. However, this relationship would not be indispensable, 
as seen previously by our group ( 5 1 ) and also supported by the fact 
that the E2 variant is functional with the substantially different 
CRISPR-II leader (24). Thus, CRISPR2.1 could be considered the 
most variable array, likely due to the proximity of the exchange- 



able cas genes. In contrast, CRISPR-II (away from the cas genes) is 
considerably more clonal, following the MLST phylogeny to a 
higher degree. 

Despite the spacer variability, the correspondence with the 
MLST phylogeny was still significant for most strains. This study 
somehow contrasted with previous works, where higher rates of 
spacer turnover and lineage heterogeneity were found (20, 50). An 
explanation for this discrepancy lies in the approach used to gen- 
erate the SGs to build the spacer tree. The methodology employed 
in this work allowed us to relate strains sharing very few or even no 
common spacers, thus contributing to a lesser dispersion of data 
and therefore to a more coherent clustering. 

With respect to the functionality of the CRISPR-Cas systems, 
the CAI divergence involving most of the cas-El genes of certain 
Bl strains suggests that, at least in this phylogenetic clade, coevo- 
lutionary adaptations might have occurred. Although the role of 
Cascade is not yet entirely characterized, a recent work has estab- 
lished that, in the process of CRISPR-mediated immunity, Csel 
recruits Cas3 for DNA degradation (9). Given that CAI may be 
affected by the levels of gene expression (43), it is plausible (on the 
basis of our results) that a lower transcription as expected for csel 
and adjacent genes might be at least partially compensated by the 
increase in cas3 expression, as inferred by the CAI of B 171 and 
associated strains. 

Also related to CRISPR-Cas functionality and spacer diversity, 
the differences in the number of the spacers that conform SGs 
could be an indication of a significant cas modification at the 
molecular level to adapt to more efficient rates of spacer uptake. 
Alternatively, they might simply reflect a change to a more chal- 
lenging environment for those strains. It seems unlikely, however, 
that such drastic ecological change has ever occurred during the 
evolutionary history of the species in all the strains. Moreover, the 
differences in the spacer numbers that constitute the SGs could 
largely be due to the genetic exchange in those regions. Therefore, 
these values should be taken as approximate, at least for the most 
variable arrays (i.e., CRISPR2. 1 ) and strains. In line with this vari- 
ability, strains ascribed to the same SG may come from different 
hosts and geographical locations (http://www.shigatox.net/new 
/reference-strains/ ecor.html), as already pointed out in a previous 
work performed with a comparable set of E. coli isolates (19). 
Nevertheless, strains reflecting more clonal relationships at the I-E 
level concurring with the MLST phylogeny (e.g., in groups D, E, 
and F and in some A and Bl clusters) could provide better estima- 
tions of such rates, which would also be applicable to the mainly 
clonal I-F system. 

In this work, two cas-E variants (El and E2) were found in 
E. coli. At present, functional studies have mainly been performed 
with K-12 derivatives carrying cas-E2, which were shown to be 
repressed by the transcriptional regulator H-NS (11, 24-26). Ad- 
ditionally, the cas-El variant has been studied on an 0157:H7 
strain, where differences with respect to K-12 were observed af- 
fecting the acquisition stage (51). Thus, the conclusions from 
K-12 might not entirely apply to the more prevalent El system: 
other strains might show different levels of activity in terms of 
spacer incorporation. 

The mobile nature of the CRISPR-Cas systems is hinted at by 
the inconsistency of phylogenetic relationships derived from Cas 
or repeat analysis and gains further support from the presence of 
complete systems in plasmids and viruses, or associated with in- 
sertion elements, that could serve as vehicles for their lateral trans- 
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fer (22, 36, 40). Moreover, it has been demonstrated that a 
CRISPR-Cas locus retains functionality even across distantly re- 
lated microorganisms (36). Furthermore, evidence has been pro- 
vided supporting the replacement of CRISPR-Cas and associated 
Cmr modules in crenarchea as a result of their mobility (40). The 
evidence presented here for the I-E system, in the form of recom- 
bination at both genes and CRISPR arrays, the different intergenic 
regions associated with specific variants, and the influence of in- 
sertion elements, was indicative of such behavior. 

The majority of strains that conserve at least a portion of the 
CRISPR-I locus possess the same tandem, consisting of a degen- 
erated repeat and a particular spacer, at the iap proximal end of the 
array (19), regardless of the cas-E variant. Additionally, slightly 
different degenerated repeats are present in some strains (19) of 
basal MLST groups in which some or even the majority of the 
members correspond to arrangement B (Fig. 4). Based on the 
parsimonious substitution model depicted in Fig. 4, the pervasive- 
ness of this tandem can be explained by our estimation of the time 
of incorporation of the CRISPR-Cas systems. This acquisition 
roughly coincides with a decrease in the population size of the 
species that therefore contributed to reduce its diversity (38). Fur- 
thermore, this prevalence of the degenerated repeat might have a 
functional purpose, such as facilitating CRISPR-Cas substitution. 
Their presence would effectively hamper the total loss of the array 
by homologous recombination with the rest of significantly dif- 
ferent type 2 repeats (19, 22). Thus, strains from arrangement B 
(missing the entire CRISPR-Cas cassette) might have been pre- 
vented from reintroducing new genes and arrays as frequently as 
in the rest. Given this premise, the terminal degenerate repeats, 
observed in arrays of many other CRISPR systems, could serve as 
"anchor" units. 

Implication of insertion elements in the genetic variability of 
the CRISPR-Cas systems. As mentioned, ISs may have played an 
important role in the removal or inactivation of cas genes and the 
replacement of I-E variants in E. coli. As an indication of the non- 
essential nature of this CRISPR-Cas subtype (11, 20), these events 
were observed in a relatively high number of E-carrying strains. 
Expendable genomic regions are usually prone not just to higher 
rates of mutation but also to being targeted by ISs (47, 52). This 
situation in I-E contrasts with the almost complete lack of inci- 
dences of IS within I-F (with the exception of B7A), which would 
point to a more crucial role of this system, at least in those strains 
carrying a complete set of cas-F genes. Yet this relevance would be 
accompanied by a lesser degree of variability within I-F and by 
more uniform rates of spacer incorporation. For I-E, its dispens- 
ability (and therefore variability) would then reflect the opposite 
situation, that of a more versatile and adaptable system that arose 
to replace the original I-F. 

Of all the insertion elements observed in this work, a special 
relevance of both IS 186 and the uncharacterized ISu in the evolu- 
tion of the cas-E system of E. coli is patent. While little is known 
about ISu, IS 1 86 has an affinity for sequences abundant in GC 
(53). Unsurprisingly, such regions can be found (i) at the IG2-IG3 
boundary (ISJS6 insertion), (ii) at the cysH-distal part of IG1 (ISu 
insertion), and (iii) mostly within the repeats of CRISPR-I and 
CRISPR-II loci. Most notably, neither type 4 repeats in I-F nor the 
type 2 degenerated in I-E (19, 22) shares these features. Therefore, 
GC-rich sequences would be acting as "hot spots" for IS insertion 
that may guide the evolution, through removal, of the CRISPR- 
Cas systems. 



Concluding remarks. The study of the diversity at the 
CRISPR-Cas regions of E. coli and related organisms revealed a 
series of events taking place in the evolutionary history of the 
species, reflected in a profuse exchange among certain strains. At 
least in the I-E system, insertion elements constituted a major 
driving force for this variability, and the degenerated (anchor) 
repeat may play a crucial role in preserving some of the CRISPR- 
Cas integrity. We have established the existence of two cas I-E 
variant subtypes in E. coli, each with its own leader and also asso- 
ciated with specific adjacent regions that confirm the modular 
nature of these systems. Previous studies on strain K-12, harbor- 
ing the less ameliorated I-E2 variant, have shown a reduced cas 
activity that, aside from the H-NS repression, might be explained 
by the marked differences in sequence with respect to I-El. In this 
regard, even the most recent analyses performed with strain 
0157:H7 carrying I-El might have the hindrance of its situation 
within a more basal MLST cluster (E), with spacer groups having 
reduced numbers of units that might reflect a lower activity of this 
system. Additional experiments with selected strains from diverse 
MLST groups might then help researchers to better understand 
these possible differences. In any case, our findings along with 
previous results strongly suggest that the spacer variability of the 
I-E arrays would preferentially be due to recombination and gene 
exchange rather than being the result of cas activity. 

MATERIALS AND METHODS 

Growth conditions. E. coli strains cultured in this study comprised a set of 
72 natural isolates known as the ECOR collection (30). LB medium was 
used for growth, and cultures were incubated at 37° C for 12 h. 

PCR and sequencing. DNA templates were extracted from cells grown 
with shaking in liquid medium. After growth, cultures were centrifuged, 
the supernatant was removed, and the cell pellet was resuspended in 1 ml 
of ultrapure (Milli-Q) water. This washing was repeated for a total of three 
times. Lysis was achieved by heating at 98° C for 10 min and cell debris 
removed by centrifugation. Finally, the supernatant solution containing 
the DNA was stored in aliquots at —20° C. 

PCRs were conducted under standard conditions (annealing temper- 
ature [I 11 ], 55°C) with Taq polymerase (Roche) on a TC-3000 thermal 
cycler (Techne). Primer cysH-F (5' CGTTTTTATTTTGCGAGCAGC 
3'), hybridizing at the conserved intergenic region closest to the cysH 
flanking gene, was used in combination with either primer cas3El -R (5 ' T 
CGTCGCCCCCGTCTTTCTC 3') or primer cas3E2-R (5' CAGATGAA 
TATCATTTCCTTTCG 3'), both hybridizing at equivalent positions 
close to the 5' end of the cas3 gene of their respective variants. PCR prod- 
ucts were purified with a QIAquick PCR purification kit (Qiagen). Se- 
quencing was performed with a BigDye Terminator cycle sequencing kit 
in an ABI Prism 3 1 0 DNA sequencer, after the manufacturer's indications 
(Applied Biosystems). 

Source of sequence data. Genomic sequences were retrieved from 
public nucleotide databases (http://www.xbase.ac.uk/main/browse/ ; http: 
//www.ncbi.nlm.nih.gov/genomes/). In the case of ECOR strains, partial 
sequences of genes used for multilocus sequence typing were downloaded 
from the Environmental Research Institute, University of Cork (http: 
//MLST.ucc.ie; dinB, icdA, pabB, polB, put?, trpA, trpB, and uidA genes), 
and from the Institut Pasteur (http://www.pasteur.fr/recherche/genopole 
/PF8/mlst/; adk,fumC, gyrB, icdA, mdh, pur A, and recA genes) websites. 
Data on repeat number and cas and spacer content, as well as on the 
presence of insertion elements in CRISPR loci of ECOR strains, derived 
from a previous study (19). 

Sequence analyses. Phylogenetic analyses of nucleotide sequences (for 
cas and MLST genes) were carried out with the program MEGA version 4 
(54) from alignments conducted with CLUSTALW (http://genome.jp 
/tools/clustalw/) and manually edited to correct mismatches. Sequence 
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trees were constructed using the unweighted-pair group method using 
average linkages (UPGMA), with distances calculated by the Jukes-Cantor 
model on a pairwise-deletion comparison that allowed the inclusion of 
partial sequences. However, the lack of proper alignment of the partially 
deleted cas3 of Shigella flexneri 2a strain 301 (Sf301) prevented its use in 
the analyses. 

For the construction of codon usage and trees based on spacer absence 
or presence, binary clustering analyses were performed with NTSYSpc 2.0 
(Exeter software). As with the sequence data, trees were built using UP- 
GMA. Distances were calculated by the average taxonomic distance 
model. For the generation of the matrix based on the combined binary 
data of the spacers from the 5 CRISPR arrays analyzed, an iterative pro- 
cedure was used to select the characters to be considered. First, for each 
CRISPR array, only those spacers present in the highest number of strains 
(defining the spacer groups [SG]) were considered. Next, strains not in- 
cluded in any SG but sharing at least one spacer with any member within 
it were recruited. Then, each remaining ungrouped strain as well as each 
SG was considered a distinct character (i.e., all strains within each group 
were assigned the same character value). Further, the same procedure was 
later applied within each SG to define potential subgroups as new distinct 
characters, although this was done only if the new results obtained were 
different from those obtained with the original SG. A CRISPR2.1 spacer 
present in the vast majority of strains, thus not being discriminative, and 
spacers that were identical but that were located in different loci (presum- 
ably acquired in separate events) were not considered for the generation of 
SGs. For the construction of trees based on codon usage, codon usage 
frequencies were determined with the Countcodon application (http: 
//www.kazusa.or.jp/codon/countcodon.html) and converted into a bi- 
nary matrix of characters. Either 1 or 0 was assigned to the codons of each 
amino acid depending on whether the score was above or below the cutoff 
value of 80% with respect to the particular maximum. 

Analyses of recombination at the ois-El sequence variants were per- 
formed with the program GENECONV 1.81 (Department of Mathemat- 
ics, Washington University, St. Louis, MO; http://www.math.wustl.edu 
/-sawyer/geneconv). Only strains bearing the complete set of cas genes 
were considered for the analysis. Two selected strains (if present) were 
chosen for each main MLST cluster. In the case of the more abundant 
strains from B 1 group, at least two strains were taken from those subclades 
diverging more than 0.2%. For each strain included in the analysis, the 
concatenated cas sequences were aligned and the nucleotide differences 
for each pair were statistically tested by the program to seek for recombi- 
national events. Pairwise comparisons rendering a Bonferroni-corrected 
Karlin-Altschul P value of less than 0.05 were deemed significant for re- 
combination between the two sequences. 

IS-Finder (https://www-is.biotoul.fr/) was used for the identification 
of insertion elements. Consensus leader sequences were obtained with 
WebLogo (http://weblogo.berkeley.edu/logo.cgi). CAI values were calcu- 
lated using the application at http://genomes.urv.es/CAIcal/ (30), with the 
codon usage frequencies of the entire genome of E. coli K-12-MG1655 
(http://www.kazusa.or.jp/codon/) as a reference. Three independent sets 
of E. coli K-12 sequences with their estimated mutation rates (ju,) were 
selected: (i) MLST analysis (this work), (ii) lacland his operons (55), and 
(iii) a collection of randomly distributed genes (56). The CAI-log u, rep- 
resentation of these genes allowed us to infer a lineal regression (r 2 > 0.99) 
which was used to extrapolate /a from the CAI of the different sets of cas 
genes (cas-El, cas-E2, and cas-Y). 

Statistical analyses. Analysis of variance (ANOVA) tests were per- 
formed using SPSS software version 17.0 (SPSS 111 Inc., Chicago, IL). A 
P value of less than 0.05 was considered significant. 

SUPPLEMENTAL MATERIAL 

Supplemental material for this article may be found at http://mbio.asm.org 
/lookup/suppl/doi: 10.11 28/mBio.00767- 1 3/-/DCSupplemental. 
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